AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).
A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.
You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.
# this will help in making the Python code more structured automatically (good coding practice)
%load_ext nb_black
# Library to suppress warnings or deprecation notes
import warnings
warnings.filterwarnings("ignore")
# Libraries to help with reading and manipulating data
import pandas as pd
import numpy as np
# Library to split data
from sklearn.model_selection import train_test_split
# libaries to help with data visualization
import matplotlib.pyplot as plt
import seaborn as sns
# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)
# Libraries to build decision tree classifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree
# To tune different models
from sklearn.model_selection import GridSearchCV
# To get diferent metric scores
from sklearn.metrics import (
f1_score,
accuracy_score,
recall_score,
precision_score,
confusion_matrix,
plot_confusion_matrix,
make_scorer,
)
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn import metrics
# AUC ROC curve
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.metrics import precision_recall_curve
from sklearn.metrics import (
confusion_matrix,
ConfusionMatrixDisplay,
) # to plot confusion matric
from sklearn.metrics import plot_confusion_matrix
from sklearn.linear_model import LogisticRegression # to build the model
from sklearn.tree import DecisionTreeClassifier # to build the model
pd.set_option("display.float_format", lambda x: "%.3f" % x)
pd.set_option("display.max_rows", 300)
pd.set_option("display.max_colwidth", 400)
pd.set_option("display.float_format", lambda x: "%.5f" % x)
# To supress numerical display in scientific notations
warnings.filterwarnings("ignore") # To supress warnings
# set the background for the graphs
plt.style.use("ggplot")
import statsmodels.api as sm
from scipy import stats
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
# To model the Gaussian Navie Bayes classifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.metrics import accuracy_score # Performance measure – Accuracy
from sklearn import preprocessing
data = pd.read_csv("Loan_Modelling.csv")
data.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Personal_Loan | Securities_Account | CD_Account | Online | CreditCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.60000 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.50000 | 1 | 0 | 0 | 1 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.00000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.70000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.00000 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
The target column which is Personal Loan is in the middle of the dataframe, it is good to have the target column at the end of the dataframe.
personal_loan = data["Personal_Loan"]
data.drop(labels=["Personal_Loan"], axis=1, inplace=True)
data.insert(13, "Personal_Loan", personal_loan)
data.head()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Securities_Account | CD_Account | Online | CreditCard | Personal_Loan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.60000 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.50000 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.00000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.70000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.00000 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
rows_count, columns_count = data.shape
print("Total number of rows :", rows_count)
print("Total number of columns :", columns_count)
Total number of rows : 5000 Total number of columns : 14
The dataset has 5000 observations (rows) and 14 attributes (columns)
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Securities_Account 5000 non-null int64 10 CD_Account 5000 non-null int64 11 Online 5000 non-null int64 12 CreditCard 5000 non-null int64 13 Personal_Loan 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
def missing_check(df):
total = (
df.isnull().sum().sort_values(ascending=False)
) # Total number of null values
percent = (df.isnull().sum() / df.isnull().count()).sort_values(
ascending=False
) # Percentage of values that are null
missing_data = pd.concat(
[total, percent], axis=1, keys=["Total", "Percent"]
) # Putting the above two together
return missing_data
missing_check(data)
| Total | Percent | |
|---|---|---|
| ID | 0 | 0.00000 |
| Age | 0 | 0.00000 |
| Experience | 0 | 0.00000 |
| Income | 0 | 0.00000 |
| ZIPCode | 0 | 0.00000 |
| Family | 0 | 0.00000 |
| CCAvg | 0 | 0.00000 |
| Education | 0 | 0.00000 |
| Mortgage | 0 | 0.00000 |
| Securities_Account | 0 | 0.00000 |
| CD_Account | 0 | 0.00000 |
| Online | 0 | 0.00000 |
| CreditCard | 0 | 0.00000 |
| Personal_Loan | 0 | 0.00000 |
data.isnull().sum().sum()
0
There are no missing values in the dataset
data.describe().transpose()
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 5000.00000 | 2500.50000 | 1443.52000 | 1.00000 | 1250.75000 | 2500.50000 | 3750.25000 | 5000.00000 |
| Age | 5000.00000 | 45.33840 | 11.46317 | 23.00000 | 35.00000 | 45.00000 | 55.00000 | 67.00000 |
| Experience | 5000.00000 | 20.10460 | 11.46795 | -3.00000 | 10.00000 | 20.00000 | 30.00000 | 43.00000 |
| Income | 5000.00000 | 73.77420 | 46.03373 | 8.00000 | 39.00000 | 64.00000 | 98.00000 | 224.00000 |
| ZIPCode | 5000.00000 | 93169.25700 | 1759.45509 | 90005.00000 | 91911.00000 | 93437.00000 | 94608.00000 | 96651.00000 |
| Family | 5000.00000 | 2.39640 | 1.14766 | 1.00000 | 1.00000 | 2.00000 | 3.00000 | 4.00000 |
| CCAvg | 5000.00000 | 1.93794 | 1.74766 | 0.00000 | 0.70000 | 1.50000 | 2.50000 | 10.00000 |
| Education | 5000.00000 | 1.88100 | 0.83987 | 1.00000 | 1.00000 | 2.00000 | 3.00000 | 3.00000 |
| Mortgage | 5000.00000 | 56.49880 | 101.71380 | 0.00000 | 0.00000 | 0.00000 | 101.00000 | 635.00000 |
| Securities_Account | 5000.00000 | 0.10440 | 0.30581 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| CD_Account | 5000.00000 | 0.06040 | 0.23825 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
| Online | 5000.00000 | 0.59680 | 0.49059 | 0.00000 | 0.00000 | 1.00000 | 1.00000 | 1.00000 |
| CreditCard | 5000.00000 | 0.29400 | 0.45564 | 0.00000 | 0.00000 | 0.00000 | 1.00000 | 1.00000 |
| Personal_Loan | 5000.00000 | 0.09600 | 0.29462 | 0.00000 | 0.00000 | 0.00000 | 0.00000 | 1.00000 |
Observations:
Important : The minimum experience of a customer is -3 which indicates presence of erroneous data beacuse experience cannot be negative.
data[data.duplicated()].count()
ID 0 Age 0 Experience 0 Income 0 ZIPCode 0 Family 0 CCAvg 0 Education 0 Mortgage 0 Securities_Account 0 CD_Account 0 Online 0 CreditCard 0 Personal_Loan 0 dtype: int64
Summary table we could see that the minimum experience is a negative number which represents bad data and needs to be fixed either by dropping the records with such data or by imputation.
I am going to check the Experience column for negative values.
# Total number of observations with negative values in their Experience column
negExp = data.Experience < 0
negExp.value_counts()
False 4948 True 52 Name: Experience, dtype: int64
52 observations of the dataset have negative values in the Experience column.
# Checking all the negative values present in the Experience column
data[data["Experience"] < 0]["Experience"].value_counts()
-1 33 -2 15 -3 4 Name: Experience, dtype: int64
Out of 52 there are 33 observations with "-1" as experience, 15 with "-2" and 4 with "-3".
I am going to see relationship of Experience with other quantitative attributes like Age, Income, CCAvg and Mortgage.
quantitativeAttr = ["Age", "Income", "CCAvg", "Mortgage"]
# Create an instance of the PairGrid class.
grid = sns.PairGrid(data=data, y_vars="Experience", x_vars=quantitativeAttr, height=4)
grid.map(sns.regplot, line_kws={"color": "Orange"})
<seaborn.axisgrid.PairGrid at 0x1ac7b630910>
Observation:
Final decision for imputation, thatI will replace each neagtive Experience value the median of positive Experience associated with the particular Age and Education value.
# Create two different dataframes with records where experience value is greater than 0 and lesser than 0 respectively
# Get the list of Customer IDs from the dataframe containing records with negative experince values
# Iterating over the Customer ID list
# Get the Age and Education level for the corresponding ID from the negative experience dataframe
# Filter the records form the positive experience dataframe based on the obtained age and education value
# calculate the median experience value from the filtered dataframe and store in "experience" variable
# if the filtered dataframe is empty, filter the negative experience dataframe and obtain the median experience value
# Replace the negative experience with the absolute value of the median "experience" value
df_positive_experience = data[data["Experience"] > 0]
df_negative_experience = data[data["Experience"] < 0]
negative_experience_id_list = df_negative_experience["ID"].tolist()
for id in negative_experience_id_list:
age = data.loc[np.where(data["ID"] == id)]["Age"].tolist()[0]
education = data.loc[np.where(data["ID"] == id)]["Education"].tolist()[0]
positive_experience_filtered = df_positive_experience[
(df_positive_experience["Age"] == age)
& (df_positive_experience["Education"] == education)
]
if positive_experience_filtered.empty:
negative_experience_filtered = df_negative_experience[
(df_negative_experience["Age"] == age)
& (df_negative_experience["Education"] == education)
]
experience = round(negative_experience_filtered["Experience"].median())
else:
experience = round(positive_experience_filtered["Experience"].median())
data.loc[data.ID == id, "Experience"] = abs(experience)
I am going to check if there are still any records with negative experience value
data[data["Experience"] < 0]["Experience"].count()
0
data.Experience.describe()
count 5000.00000 mean 20.13480 std 11.41486 min 0.00000 25% 10.00000 50% 20.00000 75% 30.00000 max 43.00000 Name: Experience, dtype: float64
The minimum value in Experience column is 0.00 which was -3.00 before error-fixing.
sns.distplot(data["Age"])
plt.title("Age Distribution with KDE")
Text(0.5, 1.0, 'Age Distribution with KDE')
The above plot shows a frequency distribution superimposed on a histogram for the Age attribute.
This distribution follows a Normal Distribution.
sns.distplot(data["Experience"])
plt.title("Experience Distribution with KDE")
Text(0.5, 1.0, 'Experience Distribution with KDE')
sns.distplot(data["Income"])
plt.title("Income Distribution with KDE")
Text(0.5, 1.0, 'Income Distribution with KDE')
Skewness score
data["Income"].skew()
0.8413386072610816
The above distribution for the Income attribute is positively skewed (right-skewed : tail goes to the right) with a skewness score of 0.8413.
sns.distplot(data["ZIPCode"])
plt.title("ZIPCode Distribution with KDE")
Text(0.5, 1.0, 'ZIPCode Distribution with KDE')
sns.countplot(data["Family"])
plt.title("Family Distribution with count for each family size")
Text(0.5, 1.0, 'Family Distribution with count for each family size')
Observations:
##### CCAvg
sns.distplot(data["CCAvg"])
plt.title("CCAvg Distribution with KDE")
Text(0.5, 1.0, 'CCAvg Distribution with KDE')
The above distribution for the CCAvg attribute is highly positively skewed (right-skewed : tail goes to the right) with a skewness score of 1.5984.
Most of the customers monthly avg. spending on credit cards is between $0 to $2500. There are very few customers whose monthly avg. spending on credit card is more than $8000.
sns.countplot(data["Education"])
plt.title("Education Distribution with count for each education level")
Text(0.5, 1.0, 'Education Distribution with count for each education level')
Undergrad level customers are more than the Graduate and Advanced/Professional customers.
sns.distplot(data["Mortgage"])
plt.title("Mortgage Distribution with KDE")
Text(0.5, 1.0, 'Mortgage Distribution with KDE')
Skewness score
data["Mortgage"].skew()
2.1040023191079444
The above distribution for the Mortgage attribute is highly positively skewed (right-skewed : tail goes to the right) with a skewness score of 2.1040.
Most of the customers do not have any mortgage. There are more customers whose mortgage amount is between $80000 and $150000 whereas there are very few whose mortgage amount is more than $600000.
sns.countplot(data["Securities_Account"])
plt.title("Securitie_Account Distribution with a Yes/No count")
Text(0.5, 1.0, 'Securitie_Account Distribution with a Yes/No count')
Most of the customers do not hold a securities account with the bank as compared to those who do.
sns.countplot(data["CD_Account"])
plt.title("CD_Account Distribution with a Yes/No count")
Text(0.5, 1.0, 'CD_Account Distribution with a Yes/No count')
Number of customers who use the internet banking facilities provided by the bank is greater than those who do not.
sns.countplot(data["CreditCard"])
plt.title("CreditCard Distribution with a Yes/No count")
Text(0.5, 1.0, 'CreditCard Distribution with a Yes/No count')
Number of customers who do not use a credit card issued by the bank is almost double than those who do.
df_cc = data["CreditCard"]
df_cc = df_cc.astype({"CreditCard": "float64"})
sns.distplot(df_cc)
<AxesSubplot:xlabel='CreditCard', ylabel='Density'>
The CreditCard column data follows Bernoulli Distrubution.
Important : This is the case for all the binary categorical variables in the dataset that can take on exactly two values i.e. 0/1, Yes/No etc..
# Calculate the PL acceptance ratio from the target column
n_true = len(data.loc[data["Personal_Loan"] == True])
n_false = len(data.loc[data["Personal_Loan"] == False])
print(
"Number of customers who accepted the PL offer: {0} ({1:2.2f}%)".format(
n_true, (n_true / (n_true + n_false)) * 100
)
)
print(
"Number of customers who did not accept the PL offer: {0} ({1:2.2f}%)".format(
n_false, (n_false / (n_true + n_false)) * 100
)
)
Number of customers who accepted the PL offer: 480 (9.60%) Number of customers who did not accept the PL offer: 4520 (90.40%)
So, I have 9.60% customers in current dataset who accepted the personal loan offer and rest of 90.40% who didn't accept.
loan_acceptance_count = pd.DataFrame(data["Personal_Loan"].value_counts()).reset_index()
loan_acceptance_count.columns = ["Labels", "Personal_Loan"]
loan_acceptance_count
| Labels | Personal_Loan | |
|---|---|---|
| 0 | 0 | 4520 |
| 1 | 1 | 480 |
# Creating dataset
pie_labels = loan_acceptance_count["Labels"]
pie_labels = ["Not Accepted" if x == 0 else "Accepted" for x in pie_labels]
pie_data = loan_acceptance_count["Personal_Loan"]
# Creating explode data
explode = (0, 0.15)
# Wedge properties
wp = {"linewidth": 1, "edgecolor": "#666666"}
# Creating autocpt arguments
def func(pct, allvalues):
absolute = int(np.round(pct / 100.0 * np.sum(allvalues)))
return "{:.1f}%\n({:d})".format(pct, absolute)
# Creating plot
fig, ax = plt.subplots(figsize=(10, 5))
ax.pie(
pie_data,
autopct=lambda pct: func(pct, pie_data),
explode=explode,
labels=pie_labels,
shadow=True,
startangle=70,
wedgeprops=wp,
)
ax.axis("equal") # Equal aspect ratio ensures that pie is drawn as a circle.
plt.title("Personal Loan Acceptance Percentage", size=19)
plt.show()
Observation:
From the above pie chart I can see that the current dataset is hugely biased towards the customers not accepting the personal loan offer.
Hence I can build an opinion that our model will tend to perform better towards predicting which customers will not accept the personal loan. However, my goal is to identify the customer who can accept the personal loan based on the given features.
sns.pairplot(data.iloc[:, 1:])
sns.pairplot(data.iloc[:, 1:])
<seaborn.axisgrid.PairGrid at 0x1ac7d859af0>
Observation:
There is no clear relationship between the ZIP Code and other variables. There is a strong and linear relationship between the Age and Experience attribute. (As I have already discussed in the above as sections well) Income and CCAvg attributes are moderately correlated. Similarly, Income and Mortgage are also moderately correlated.
data.corr()
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Securities_Account | CD_Account | Online | CreditCard | Personal_Loan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | 1.00000 | -0.00847 | -0.00826 | -0.01769 | 0.00224 | -0.01680 | -0.02468 | 0.02146 | -0.01392 | -0.01697 | -0.00691 | -0.00253 | 0.01703 | -0.02480 |
| Age | -0.00847 | 1.00000 | 0.99404 | -0.05527 | -0.03053 | -0.04642 | -0.05201 | 0.04133 | -0.01254 | -0.00044 | 0.00804 | 0.01370 | 0.00768 | -0.00773 |
| Experience | -0.00826 | 0.99404 | 1.00000 | -0.04691 | -0.03075 | -0.05195 | -0.04986 | 0.01403 | -0.01112 | -0.00105 | 0.00973 | 0.01392 | 0.00899 | -0.00831 |
| Income | -0.01769 | -0.05527 | -0.04691 | 1.00000 | -0.03071 | -0.15750 | 0.64598 | -0.18752 | 0.20681 | -0.00262 | 0.16974 | 0.01421 | -0.00239 | 0.50246 |
| ZIPCode | 0.00224 | -0.03053 | -0.03075 | -0.03071 | 1.00000 | 0.02751 | -0.01219 | -0.00827 | 0.00361 | 0.00242 | 0.02167 | 0.02832 | 0.02403 | -0.00297 |
| Family | -0.01680 | -0.04642 | -0.05195 | -0.15750 | 0.02751 | 1.00000 | -0.10927 | 0.06493 | -0.02044 | 0.01999 | 0.01411 | 0.01035 | 0.01159 | 0.06137 |
| CCAvg | -0.02468 | -0.05201 | -0.04986 | 0.64598 | -0.01219 | -0.10927 | 1.00000 | -0.13612 | 0.10990 | 0.01509 | 0.13653 | -0.00361 | -0.00669 | 0.36689 |
| Education | 0.02146 | 0.04133 | 0.01403 | -0.18752 | -0.00827 | 0.06493 | -0.13612 | 1.00000 | -0.03333 | -0.01081 | 0.01393 | -0.01500 | -0.01101 | 0.13672 |
| Mortgage | -0.01392 | -0.01254 | -0.01112 | 0.20681 | 0.00361 | -0.02044 | 0.10990 | -0.03333 | 1.00000 | -0.00541 | 0.08931 | -0.00599 | -0.00723 | 0.14210 |
| Securities_Account | -0.01697 | -0.00044 | -0.00105 | -0.00262 | 0.00242 | 0.01999 | 0.01509 | -0.01081 | -0.00541 | 1.00000 | 0.31703 | 0.01263 | -0.01503 | 0.02195 |
| CD_Account | -0.00691 | 0.00804 | 0.00973 | 0.16974 | 0.02167 | 0.01411 | 0.13653 | 0.01393 | 0.08931 | 0.31703 | 1.00000 | 0.17588 | 0.27864 | 0.31635 |
| Online | -0.00253 | 0.01370 | 0.01392 | 0.01421 | 0.02832 | 0.01035 | -0.00361 | -0.01500 | -0.00599 | 0.01263 | 0.17588 | 1.00000 | 0.00421 | 0.00628 |
| CreditCard | 0.01703 | 0.00768 | 0.00899 | -0.00239 | 0.02403 | 0.01159 | -0.00669 | -0.01101 | -0.00723 | -0.01503 | 0.27864 | 0.00421 | 1.00000 | 0.00280 |
| Personal_Loan | -0.02480 | -0.00773 | -0.00831 | 0.50246 | -0.00297 | 0.06137 | 0.36689 | 0.13672 | 0.14210 | 0.02195 | 0.31635 | 0.00628 | 0.00280 | 1.00000 |
plt.figure(figsize=(15, 7))
plt.title("Correlation of Attributes", size=15)
sns.heatmap(data.corr(), annot=True, linewidths=3, fmt=".3f", center=1)
<AxesSubplot:title={'center':'Correlation of Attributes'}>
Observation:
plt.figure(figsize=(10, 5))
sns.boxplot(x="Education", y="Income", hue="Personal_Loan", data=data)
<AxesSubplot:xlabel='Education', ylabel='Income'>
Observation:
plt.figure(figsize=(10, 5))
sns.boxplot(x="Education", y="Mortgage", hue="Personal_Loan", data=data)
<AxesSubplot:xlabel='Education', ylabel='Mortgage'>
Observation:
plt.figure(figsize=(10, 5))
sns.barplot(x="Personal_Loan", y="CCAvg", data=data)
<AxesSubplot:xlabel='Personal_Loan', ylabel='CCAvg'>
Observation:
Customers with high monthly average spending on credit cards are more likely to take a loan.
plt.figure(figsize=(10, 5))
sns.countplot(x="CD_Account", data=data, hue="Personal_Loan")
<AxesSubplot:xlabel='CD_Account', ylabel='count'>
Observation:
Customers who do not have a CD Account with the bank, do not have loan as well. This seems to be the majority. Also, almost all customers who have a CD Account with the bank, have taken the loan as well.
Data Description:
Data Cleaning:
Observations from EDA:
Actions for data pre-processing:
# Saving dataset before treating outliers for logistic regression. = Personal_Loan.copy()
data = data.copy()
numeric_columns = ["Income", "CCAvg", "Mortgage", "Age"]
# outlier detection using boxplot
plt.figure(figsize=(20, 30))
for i, variable in enumerate(numeric_columns):
plt.subplot(4, 4, i + 1)
plt.boxplot(data[variable], whis=1.5)
plt.tight_layout()
plt.title(variable)
plt.show()
# Check Income greatest values
data.sort_values(by=["Income"], ascending=False).head(10)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Securities_Account | CD_Account | Online | CreditCard | Personal_Loan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3896 | 3897 | 48 | 24 | 224 | 93940 | 2 | 6.67000 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |
| 4993 | 4994 | 45 | 21 | 218 | 91801 | 2 | 6.67000 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 526 | 527 | 26 | 2 | 205 | 93106 | 1 | 6.33000 | 1 | 271 | 0 | 0 | 0 | 1 | 0 |
| 2988 | 2989 | 46 | 21 | 205 | 95762 | 2 | 8.80000 | 1 | 181 | 1 | 0 | 1 | 0 | 0 |
| 4225 | 4226 | 43 | 18 | 204 | 91902 | 2 | 8.80000 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 677 | 678 | 46 | 21 | 204 | 92780 | 2 | 2.80000 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2278 | 2279 | 30 | 4 | 204 | 91107 | 2 | 4.50000 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 3804 | 3805 | 47 | 22 | 203 | 95842 | 2 | 8.80000 | 1 | 0 | 0 | 0 | 1 | 0 | 0 |
| 2101 | 2102 | 35 | 5 | 203 | 95032 | 1 | 10.00000 | 3 | 0 | 0 | 0 | 0 | 0 | 1 |
| 787 | 788 | 45 | 15 | 202 | 91380 | 3 | 10.00000 | 3 | 0 | 0 | 0 | 0 | 0 | 1 |
# Data by Experience
data.loc[(data["Age"] == 48) & (data["Experience"] == 24)].sort_values(
by=["Income"], ascending=False
).head(10)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Securities_Account | CD_Account | Online | CreditCard | Personal_Loan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 3896 | 3897 | 48 | 24 | 224 | 93940 | 2 | 6.67000 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |
| 196 | 197 | 48 | 24 | 165 | 93407 | 1 | 5.00000 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 2217 | 2218 | 48 | 24 | 162 | 91355 | 4 | 3.30000 | 2 | 446 | 0 | 1 | 1 | 0 | 1 |
| 4629 | 4630 | 48 | 24 | 148 | 91311 | 2 | 3.30000 | 1 | 0 | 0 | 1 | 1 | 1 | 0 |
| 4167 | 4168 | 48 | 24 | 144 | 94025 | 4 | 3.50000 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1763 | 1764 | 48 | 24 | 134 | 94105 | 1 | 5.00000 | 1 | 0 | 0 | 0 | 0 | 1 | 0 |
| 3360 | 3361 | 48 | 24 | 133 | 90740 | 1 | 5.00000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1866 | 1867 | 48 | 24 | 90 | 94523 | 1 | 2.60000 | 2 | 334 | 0 | 0 | 1 | 0 | 0 |
| 761 | 762 | 48 | 24 | 84 | 92152 | 3 | 0.70000 | 1 | 166 | 0 | 0 | 1 | 0 | 0 |
| 35 | 36 | 48 | 24 | 81 | 92647 | 3 | 0.70000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
# CCAVg Greatest values
data.sort_values(by=["CCAvg"], ascending=False).head(10)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Securities_Account | CD_Account | Online | CreditCard | Personal_Loan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 787 | 788 | 45 | 15 | 202 | 91380 | 3 | 10.00000 | 3 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2101 | 2102 | 35 | 5 | 203 | 95032 | 1 | 10.00000 | 3 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2337 | 2338 | 43 | 16 | 201 | 95054 | 1 | 10.00000 | 2 | 0 | 0 | 0 | 0 | 1 | 1 |
| 3943 | 3944 | 61 | 36 | 188 | 91360 | 1 | 9.30000 | 2 | 0 | 0 | 0 | 0 | 0 | 1 |
| 3822 | 3823 | 63 | 33 | 178 | 91768 | 4 | 9.00000 | 3 | 0 | 0 | 0 | 0 | 0 | 1 |
| 1339 | 1340 | 52 | 25 | 180 | 94545 | 2 | 9.00000 | 2 | 297 | 0 | 0 | 1 | 0 | 1 |
| 9 | 10 | 34 | 9 | 180 | 93023 | 1 | 8.90000 | 3 | 0 | 0 | 0 | 0 | 0 | 1 |
| 2769 | 2770 | 33 | 9 | 183 | 91320 | 2 | 8.80000 | 3 | 582 | 0 | 0 | 1 | 0 | 1 |
| 2447 | 2448 | 44 | 19 | 201 | 95819 | 2 | 8.80000 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
| 917 | 918 | 45 | 20 | 200 | 90405 | 2 | 8.80000 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
# Mortgage Greatest values
data.sort_values(by=["Mortgage"], ascending=False).head(10)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Securities_Account | CD_Account | Online | CreditCard | Personal_Loan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2934 | 2935 | 37 | 13 | 195 | 91763 | 2 | 6.50000 | 1 | 635 | 0 | 0 | 1 | 0 | 0 |
| 303 | 304 | 49 | 25 | 195 | 95605 | 4 | 3.00000 | 1 | 617 | 0 | 0 | 0 | 0 | 1 |
| 4812 | 4813 | 29 | 4 | 184 | 92126 | 4 | 2.20000 | 3 | 612 | 0 | 0 | 1 | 0 | 1 |
| 1783 | 1784 | 53 | 27 | 192 | 94720 | 1 | 1.70000 | 1 | 601 | 0 | 0 | 1 | 0 | 0 |
| 4842 | 4843 | 49 | 23 | 174 | 95449 | 3 | 4.60000 | 2 | 590 | 0 | 0 | 0 | 0 | 1 |
| 1937 | 1938 | 51 | 25 | 181 | 95051 | 1 | 3.30000 | 3 | 589 | 1 | 1 | 1 | 0 | 1 |
| 782 | 783 | 54 | 30 | 194 | 92056 | 3 | 6.00000 | 3 | 587 | 1 | 1 | 1 | 1 | 1 |
| 2769 | 2770 | 33 | 9 | 183 | 91320 | 2 | 8.80000 | 3 | 582 | 0 | 0 | 1 | 0 | 1 |
| 4655 | 4656 | 33 | 7 | 188 | 95054 | 2 | 7.00000 | 2 | 581 | 0 | 0 | 0 | 0 | 1 |
| 4345 | 4346 | 26 | 1 | 184 | 94608 | 2 | 4.20000 | 3 | 577 | 0 | 1 | 1 | 1 | 1 |
I am looking at some really highest values in income 230 USD compared to same age group and experience. Values for Credit card and Mortages looks fine.After identifying outliers, we can decide to remove/treat them or not. It depends,here I am not going to treat them as there will be outliers in real case scenario (in Income, Mortgage value, Average spending on the credit card, etc) and I would want our model to learn the underlying pattern for such customers.
data
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Securities_Account | CD_Account | Online | CreditCard | Personal_Loan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.60000 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.50000 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.00000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.70000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.00000 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 4995 | 4996 | 29 | 3 | 40 | 92697 | 1 | 1.90000 | 3 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4996 | 4997 | 30 | 4 | 15 | 92037 | 4 | 0.40000 | 1 | 85 | 0 | 0 | 1 | 0 | 0 |
| 4997 | 4998 | 63 | 39 | 24 | 93023 | 2 | 0.30000 | 3 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4998 | 4999 | 65 | 40 | 49 | 90034 | 3 | 0.50000 | 2 | 0 | 0 | 0 | 1 | 0 | 0 |
| 4999 | 5000 | 28 | 4 | 83 | 92612 | 3 | 0.80000 | 1 | 0 | 0 | 0 | 1 | 1 | 0 |
5000 rows × 14 columns
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5000 entries, 0 to 4999 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 ID 5000 non-null int64 1 Age 5000 non-null int64 2 Experience 5000 non-null int64 3 Income 5000 non-null int64 4 ZIPCode 5000 non-null int64 5 Family 5000 non-null int64 6 CCAvg 5000 non-null float64 7 Education 5000 non-null int64 8 Mortgage 5000 non-null int64 9 Securities_Account 5000 non-null int64 10 CD_Account 5000 non-null int64 11 Online 5000 non-null int64 12 CreditCard 5000 non-null int64 13 Personal_Loan 5000 non-null int64 dtypes: float64(1), int64(13) memory usage: 547.0 KB
data.head(5)
| ID | Age | Experience | Income | ZIPCode | Family | CCAvg | Education | Mortgage | Securities_Account | CD_Account | Online | CreditCard | Personal_Loan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.60000 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.50000 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.00000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.70000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.00000 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
data = data.drop(["ID", "ZIPCode"], axis=1)
data.head(5)
| Age | Experience | Income | Family | CCAvg | Education | Mortgage | Securities_Account | CD_Account | Online | CreditCard | Personal_Loan | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 25 | 1 | 49 | 4 | 1.60000 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1 | 45 | 19 | 34 | 3 | 1.50000 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 2 | 39 | 15 | 11 | 1 | 1.00000 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 35 | 9 | 100 | 1 | 2.70000 | 2 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 35 | 8 | 45 | 4 | 1.00000 | 2 | 0 | 0 | 0 | 0 | 1 | 0 |
loanwithexperience = data
loanwithoutexperience = data.drop(["Experience"], axis=1)
print("Columns With Experience : ", loanwithexperience.columns)
print("Columns Without Experience : ", loanwithoutexperience.columns)
Columns With Experience : Index(['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education',
'Mortgage', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard',
'Personal_Loan'],
dtype='object')
Columns Without Experience : Index(['Age', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage',
'Securities_Account', 'CD_Account', 'Online', 'CreditCard',
'Personal_Loan'],
dtype='object')
# From Exprenece Dataframe:
X_Expr = loanwithexperience.drop("Personal_Loan", axis=1)
Y_Expr = loanwithexperience[["Personal_Loan"]]
# From Exprenece Dataframe:
X_Without_Expr = loanwithoutexperience.drop("Personal_Loan", axis=1)
Y_Without_Expr = loanwithoutexperience[["Personal_Loan"]]
Spliting the data into training and test set in the ratio of 70:30
# From Experience Dataframe:
X_Expr_train, X_Expr_test, y_Expr_train, y_Expr_test = train_test_split(
X_Expr, Y_Expr, test_size=0.30, random_state=1
)
print("x train data {}".format(X_Expr_train.shape))
print("y train data {}".format(y_Expr_train.shape))
print("x test data {}".format(X_Expr_test.shape))
print("y test data {}".format(y_Expr_test.shape))
x train data (3500, 11) y train data (3500, 1) x test data (1500, 11) y test data (1500, 1)
# From Without Experience Dataframe:
X_train, X_test, y_train, y_test = train_test_split(
X_Without_Expr, Y_Without_Expr, test_size=0.30, random_state=1
)
print("x train data {}".format(X_train.shape))
print("y train data {}".format(y_train.shape))
print("x test data {}".format(X_test.shape))
print("y test data {}".format(y_test.shape))
x train data (3500, 10) y train data (3500, 1) x test data (1500, 10) y test data (1500, 1)
# X_Exp_train, X_Exp_test, y_Exp_train, y_Exp_test
logreg_expr_model = LogisticRegression()
logreg_expr_model.fit(X_Expr_train, y_Expr_train)
# Predicting for test set
logreg_expr_y_predicted = logreg_expr_model.predict(X_Expr_test)
logreg_expr_score = logreg_expr_model.score(X_Expr_test, y_Expr_test)
logreg_expr_accuracy = accuracy_score(y_Expr_test, logreg_expr_y_predicted)
logestic_confusion_matrix_expr = metrics.confusion_matrix(
y_Expr_test, logreg_expr_y_predicted
)
# X_train, X_test, y_train, y_test
logreg_model = LogisticRegression()
logreg_model.fit(X_train, y_train)
# Predicting for test set
logreg_y_predicted = logreg_model.predict(X_test)
logreg_score = logreg_model.score(X_test, y_test)
logreg_accuracy = accuracy_score(y_test, logreg_y_predicted)
logestic_confusion_matrix = metrics.confusion_matrix(y_test, logreg_y_predicted)
# Accuracy
print("Logistic Regression Model Accuracy Score W/O Experience : %f" % logreg_accuracy)
print(
"Logistic Regression Model Accuracy Score With Experience : %f"
% logreg_expr_accuracy
)
# Confusion Matrix
print(
"\nLogistic Regression Confusion Matrix W/O Experience: \n",
logestic_confusion_matrix,
)
print("\nTrue Possitive = ", logestic_confusion_matrix[1][1])
print("True Negative = ", logestic_confusion_matrix[0][0])
print("False Possive = ", logestic_confusion_matrix[0][1])
print("False Negative = ", logestic_confusion_matrix[1][0])
print(
"Logistic Regression Confusion Matrix With Experience: ",
logestic_confusion_matrix_expr,
)
print("\nTrue Possitive = ", logestic_confusion_matrix_expr[1][1])
print("True Negative = ", logestic_confusion_matrix_expr[0][0])
print("False Possive = ", logestic_confusion_matrix_expr[0][1])
print("False Negative = ", logestic_confusion_matrix_expr[1][0])
Logistic Regression Model Accuracy Score W/O Experience : 0.940000 Logistic Regression Model Accuracy Score With Experience : 0.944000 Logistic Regression Confusion Matrix W/O Experience: [[1339 12] [ 78 71]] True Possitive = 71 True Negative = 1339 False Possive = 12 False Negative = 78 Logistic Regression Confusion Matrix With Experience: [[1334 17] [ 67 82]] True Possitive = 82 True Negative = 1334 False Possive = 17 False Negative = 67
Observation:
Predicting a customer will contribute to the revenue but in reality the customer would not have contribute to the revenue. - Loss of resources
Predicting a customer will not contribute to revenue but in reality the customer would have contributed to revenue. - Loss of opportunity
recall should be maximized, the greater the recall higher the chances of minimizing the false negatives.from sklearn import preprocessing
# X_Expr_train, X_Expr_test, y_Expr_train, y_Expr_test
X_train_scaled = preprocessing.scale(X_Expr_train)
X_test_scaled = preprocessing.scale(X_Expr_test)
scaled_logreg_model = LogisticRegression()
scaled_logreg_model.fit(X_train_scaled, y_Expr_train)
# Predicting for test set
scaled_logreg_y_predicted = scaled_logreg_model.predict(X_test_scaled)
scaled_logreg_model_score = scaled_logreg_model.score(X_test_scaled, y_Expr_test)
scaled_logreg_accuracy = accuracy_score(y_Expr_test, scaled_logreg_y_predicted)
scaled_logreg_confusion_matrix = metrics.confusion_matrix(
y_Expr_test, scaled_logreg_y_predicted
)
print(
"After Scalling Logistic Regression Model Accuracy Score with Experience: %f"
% scaled_logreg_accuracy
)
print(
"\nAfter Scalling Logistic Regression Confusion Matrix With Experience: \n",
scaled_logreg_confusion_matrix,
)
print("\nTrue Possitive = ", scaled_logreg_confusion_matrix[1][1])
print("True Negative = ", scaled_logreg_confusion_matrix[0][0])
print("False Possive = ", scaled_logreg_confusion_matrix[0][1])
print("False Negative = ", scaled_logreg_confusion_matrix[1][0])
print(
"\nK-NN classification Report : \n",
metrics.classification_report(y_Expr_test, scaled_logreg_y_predicted),
)
conf_table = scaled_logreg_confusion_matrix
a = (conf_table[0, 0] + conf_table[1, 1]) / (
conf_table[0, 0] + conf_table[0, 1] + conf_table[1, 0] + conf_table[1, 1]
)
p = conf_table[1, 1] / (conf_table[1, 1] + conf_table[0, 1])
r = conf_table[1, 1] / (conf_table[1, 1] + conf_table[1, 0])
f = (2 * p * r) / (p + r)
print("Accuracy of accepting Loan : ", round(a, 2))
print("precision of accepting Loan : ", round(p, 2))
print("recall of accepting Loan : ", round(r, 2))
print("F1 score of accepting Loan : ", round(f, 2))
After Scalling Logistic Regression Model Accuracy Score with Experience: 0.947333
After Scalling Logistic Regression Confusion Matrix With Experience:
[[1333 18]
[ 61 88]]
True Possitive = 88
True Negative = 1333
False Possive = 18
False Negative = 61
K-NN classification Report :
precision recall f1-score support
0 0.96 0.99 0.97 1351
1 0.83 0.59 0.69 149
accuracy 0.95 1500
macro avg 0.89 0.79 0.83 1500
weighted avg 0.94 0.95 0.94 1500
Accuracy of accepting Loan : 0.95
precision of accepting Loan : 0.83
recall of accepting Loan : 0.59
F1 score of accepting Loan : 0.69
# Creating number list from range 1 to 20 of K for KNN
numberList = list(range(1, 20))
neighbors = list(
filter(lambda x: x % 2 != 0, numberList)
) # subsetting just the odd ones
# Declearing a empty list that will hold the accuracy scores
ac_scores = []
# performing accuracy metrics for value from 1,3,5....19
for k in neighbors:
knn = KNeighborsClassifier(n_neighbors=k)
# predict the response
knn.fit(X_train, y_train.values.ravel())
y_pred = knn.predict(X_test)
# evaluate accuracy
scores = accuracy_score(y_test, y_pred)
# insert scores to the list
ac_scores.append(scores)
MSE = [1 - x for x in ac_scores] # changing to misclassification error
# determining best k
optimal_k = neighbors[MSE.index(min(MSE))]
print("Odd Neighbors : \n", neighbors)
print("\nAccuracy Score : \n", ac_scores)
print("\nMisclassification error :\n", MSE)
print("\nThe optimal number of neighbor is k=", optimal_k)
# plot misclassification error vs k
plt.plot(neighbors, MSE)
plt.xlabel("Number of Neighbors K")
plt.ylabel("Misclassification Error")
plt.show()
Odd Neighbors : [1, 3, 5, 7, 9, 11, 13, 15, 17, 19] Accuracy Score : [0.9086666666666666, 0.9093333333333333, 0.906, 0.9026666666666666, 0.9053333333333333, 0.908, 0.908, 0.9066666666666666, 0.9073333333333333, 0.904] Misclassification error : [0.09133333333333338, 0.09066666666666667, 0.09399999999999997, 0.09733333333333338, 0.09466666666666668, 0.09199999999999997, 0.09199999999999997, 0.09333333333333338, 0.09266666666666667, 0.09599999999999997] The optimal number of neighbor is k= 3
Decission:
X_dt = data.drop("Personal_Loan", axis=1)
y_dt = data["Personal_Loan"]
# Spliting data set
X_train_dt, X_test_dt, y_train_dt, y_test_dt = train_test_split(
X_dt, y_dt, test_size=0.3, random_state=1, stratify=y_dt
)
If the frequency of class A is 10% and the frequency of class B is 90%, then class B will become the dominant class and the decision tree will become biased toward the dominant classes.
To handle this imbalanced data set,we can pass a dictionary {0:0.15,1:0.85} to the model to specify the weight of each class and the decision tree will give more weightage to class 1.
class_weight is a hyperparameter for the decision tree classifier.
## Function to calculate recall score
def get_recall_score(model, predictors, target):
"""
model: classifier
predictors: independent variables
target: dependent variable
"""
prediction = model.predict(predictors)
return recall_score(target, prediction)
def confusion_matrix_sklearn(model, predictors, target):
"""
To plot the confusion_matrix with percentages
model: classifier
predictors: independent variables
target: dependent variable
"""
y_pred = model.predict(predictors)
cm = confusion_matrix(target, y_pred)
labels = np.asarray(
[
["{0:0.0f}".format(item) + "\n{0:.2%}".format(item / cm.flatten().sum())]
for item in cm.flatten()
]
).reshape(2, 2)
plt.figure(figsize=(6, 4))
sns.heatmap(cm, annot=labels, fmt="")
plt.ylabel("True label")
plt.xlabel("Predicted label")
model = DecisionTreeClassifier(
criterion="gini", class_weight={0: 0.15, 1: 0.85}, random_state=1
)
model.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, random_state=1)
confusion_matrix_sklearn(model, X_train, y_train)
decision_tree_perf_train = get_recall_score(model, X_train, y_train)
print("Recall Score:", decision_tree_perf_train)
Recall Score: 1.0
confusion_matrix_sklearn(model, X_test, y_test)
decision_tree_perf_test = get_recall_score(model, X_test, y_test)
print("Recall Score:", decision_tree_perf_test)
Recall Score: 0.8791946308724832
## creating a list of column names
feature_names = X_train.columns.to_list()
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
model,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
# below code will add arrows to the decision tree split if they are missing
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(model, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Income <= 81.50 | | | | | |--- Age <= 36.50 | | | | | | |--- Family <= 3.50 | | | | | | | |--- CCAvg <= 3.50 | | | | | | | | |--- Online <= 0.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- Online > 0.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- CCAvg > 3.50 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Family > 3.50 | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | |--- Age > 36.50 | | | | | | |--- Education <= 1.50 | | | | | | | |--- Income <= 74.50 | | | | | | | | |--- weights: [1.80, 0.00] class: 0 | | | | | | | |--- Income > 74.50 | | | | | | | | |--- CCAvg <= 3.60 | | | | | | | | | |--- Age <= 43.50 | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | |--- Age > 43.50 | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- CCAvg > 3.60 | | | | | | | | | |--- weights: [0.60, 0.00] class: 0 | | | | | | |--- Education > 1.50 | | | | | | | |--- Online <= 0.50 | | | | | | | | |--- weights: [1.65, 0.00] class: 0 | | | | | | | |--- Online > 0.50 | | | | | | | | |--- weights: [2.40, 0.00] class: 0 | | | | |--- Income > 81.50 | | | | | |--- Mortgage <= 152.00 | | | | | | |--- Securities_Account <= 0.50 | | | | | | | |--- CCAvg <= 3.05 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- CCAvg > 3.05 | | | | | | | | |--- CCAvg <= 3.85 | | | | | | | | | |--- Income <= 93.50 | | | | | | | | | | |--- Age <= 45.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | | |--- Age > 45.50 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- Income > 93.50 | | | | | | | | | | |--- Family <= 1.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | | | |--- Family > 1.50 | | | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- CCAvg > 3.85 | | | | | | | | | |--- weights: [0.00, 2.55] class: 1 | | | | | | |--- Securities_Account > 0.50 | | | | | | | |--- CCAvg <= 3.30 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | |--- CCAvg > 3.30 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- Mortgage > 152.00 | | | | | | |--- Education <= 1.50 | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | |--- Education > 1.50 | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | |--- CCAvg > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- CCAvg <= 4.50 | | | | |--- weights: [0.00, 6.80] class: 1 | | | |--- CCAvg > 4.50 | | | | |--- weights: [0.15, 0.00] class: 0 |--- Income > 98.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- Income <= 100.00 | | | | |--- CCAvg <= 4.20 | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | |--- CCAvg > 4.20 | | | | | |--- Securities_Account <= 0.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | |--- Securities_Account > 0.50 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | |--- Income > 100.00 | | | | |--- Income <= 103.50 | | | | | |--- Securities_Account <= 0.50 | | | | | | |--- weights: [2.10, 0.00] class: 0 | | | | | |--- Securities_Account > 0.50 | | | | | | |--- CCAvg <= 3.06 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- CCAvg > 3.06 | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | |--- Income > 103.50 | | | | | |--- CreditCard <= 0.50 | | | | | | |--- weights: [45.30, 0.00] class: 0 | | | | | |--- CreditCard > 0.50 | | | | | | |--- weights: [19.65, 0.00] class: 0 | | |--- Family > 2.50 | | | |--- Income <= 108.50 | | | | |--- Family <= 3.50 | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | |--- Family > 3.50 | | | | | |--- Income <= 102.00 | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- Income > 102.00 | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | |--- Income > 108.50 | | | | |--- Age <= 26.00 | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | |--- Age > 26.00 | | | | | |--- Income <= 113.50 | | | | | | |--- Age <= 57.50 | | | | | | | |--- Income <= 112.00 | | | | | | | | |--- Securities_Account <= 0.50 | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | |--- Securities_Account > 0.50 | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | |--- Income > 112.00 | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | |--- Age > 57.50 | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | |--- Income > 113.50 | | | | | | |--- weights: [0.00, 41.65] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.80 | | | | |--- Income <= 106.50 | | | | | |--- weights: [5.40, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- Age <= 57.50 | | | | | | |--- Age <= 41.50 | | | | | | | |--- Mortgage <= 51.50 | | | | | | | | |--- CCAvg <= 1.55 | | | | | | | | | |--- weights: [1.05, 0.00] class: 0 | | | | | | | | |--- CCAvg > 1.55 | | | | | | | | | |--- CCAvg <= 1.75 | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | | | | |--- CCAvg > 1.75 | | | | | | | | | | |--- CCAvg <= 2.40 | | | | | | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | | | | | | | | |--- CCAvg > 2.40 | | | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | |--- Mortgage > 51.50 | | | | | | | | |--- Online <= 0.50 | | | | | | | | | |--- weights: [0.75, 0.00] class: 0 | | | | | | | | |--- Online > 0.50 | | | | | | | | | |--- weights: [0.90, 0.00] class: 0 | | | | | | |--- Age > 41.50 | | | | | | | |--- CCAvg <= 0.35 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- CCAvg > 0.35 | | | | | | | | |--- Online <= 0.50 | | | | | | | | | |--- weights: [0.15, 0.00] class: 0 | | | | | | | | |--- Online > 0.50 | | | | | | | | | |--- Mortgage <= 206.00 | | | | | | | | | | |--- CCAvg <= 1.85 | | | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | | | | | | | |--- CCAvg > 1.85 | | | | | | | | | | | |--- truncated branch of depth 3 | | | | | | | | | |--- Mortgage > 206.00 | | | | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | |--- Age > 57.50 | | | | | | |--- weights: [1.35, 0.00] class: 0 | | | |--- CCAvg > 2.80 | | | | |--- Age <= 63.50 | | | | | |--- Family <= 1.50 | | | | | | |--- Online <= 0.50 | | | | | | | |--- weights: [0.00, 1.70] class: 1 | | | | | | |--- Online > 0.50 | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | |--- Family > 1.50 | | | | | | |--- Income <= 99.50 | | | | | | | |--- Family <= 2.50 | | | | | | | | |--- weights: [0.45, 0.00] class: 0 | | | | | | | |--- Family > 2.50 | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | |--- Income > 99.50 | | | | | | | |--- Age <= 60.50 | | | | | | | | |--- Income <= 100.50 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- Income > 100.50 | | | | | | | | | |--- weights: [0.00, 15.30] class: 1 | | | | | | | |--- Age > 60.50 | | | | | | | | |--- Family <= 3.00 | | | | | | | | | |--- weights: [0.00, 0.85] class: 1 | | | | | | | | |--- Family > 3.00 | | | | | | | | | |--- weights: [0.30, 0.00] class: 0 | | | | |--- Age > 63.50 | | | | | |--- weights: [0.30, 0.00] class: 0 | | |--- Income > 116.50 | | | |--- Securities_Account <= 0.50 | | | | |--- weights: [0.00, 165.75] class: 1 | | | |--- Securities_Account > 0.50 | | | | |--- weights: [0.00, 22.95] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the criterion brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
model.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.59844 Family 0.14751 Education 0.12472 CCAvg 0.09312 Age 0.01276 CD_Account 0.01100 Mortgage 0.00509 Securities_Account 0.00472 Online 0.00265 CreditCard 0.00000
importances = model.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Using GridSearch for Hyperparameter tuning of our tree model
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1, class_weight={0: 0.15, 1: 0.85})
# Grid of parameters to choose from
parameters = {
"max_depth": [5, 10, 15, None],
"criterion": ["entropy", "gini"],
"splitter": ["best", "random"],
"min_impurity_decrease": [0.00001, 0.0001, 0.01],
}
# Type of scoring used to compare parameter combinations
scorer = make_scorer(recall_score)
# Run the grid search
grid_obj = GridSearchCV(estimator, parameters, scoring=scorer, cv=5)
grid_obj = grid_obj.fit(X_train, y_train)
# Set the clf to the best combination of parameters
estimator = grid_obj.best_estimator_
# Fit the best algorithm to the data.
estimator.fit(X_train, y_train)
DecisionTreeClassifier(class_weight={0: 0.15, 1: 0.85}, criterion='entropy',
max_depth=5, min_impurity_decrease=0.01, random_state=1,
splitter='random')
confusion_matrix_sklearn(estimator, X_train, y_train)
decision_tree_tune_perf_train = get_recall_score(estimator, X_train, y_train)
print("Recall Score:", decision_tree_tune_perf_train)
Recall Score: 0.9758308157099698
confusion_matrix_sklearn(estimator, X_test, y_test)
decision_tree_tune_perf_test = get_recall_score(estimator, X_test, y_test)
print("Recall Score:", decision_tree_tune_perf_test)
Recall Score: 0.959731543624161
plt.figure(figsize=(15, 10))
out = tree.plot_tree(
estimator,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(estimator, feature_names=feature_names, show_weights=True))
|--- Income <= 104.35 | |--- CCAvg <= 3.31 | | |--- CD_Account <= 0.92 | | | |--- weights: [373.35, 3.40] class: 0 | | |--- CD_Account > 0.92 | | | |--- weights: [13.50, 3.40] class: 0 | |--- CCAvg > 3.31 | | |--- weights: [15.45, 24.65] class: 1 |--- Income > 104.35 | |--- Education <= 1.67 | | |--- Family <= 3.07 | | | |--- Family <= 2.68 | | | | |--- weights: [64.65, 0.00] class: 0 | | | |--- Family > 2.68 | | | | |--- weights: [0.75, 29.75] class: 1 | | |--- Family > 3.07 | | | |--- weights: [0.00, 15.30] class: 1 | |--- Education > 1.67 | | |--- weights: [7.65, 204.85] class: 1
Obsevation:
The DecisionTreeClassifier provides parameters such as min_samples_leaf and max_depth to prevent a tree from overfiting. Cost complexity pruning provides another option to control the size of a tree. In DecisionTreeClassifier, this pruning technique is parameterized by the cost complexity parameter, ccp_alpha. Greater values of ccp_alpha increase the number of nodes pruned. Here we only show the effect of ccp_alpha on regularizing the trees and how to choose a ccp_alpha based on validation scores.
clf = DecisionTreeClassifier(random_state=1, class_weight={0: 0.15, 1: 0.85})
path = clf.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas, impurities = path.ccp_alphas, path.impurities
pd.DataFrame(path)
| ccp_alphas | impurities | |
|---|---|---|
| 0 | 0.00000 | -0.00000 |
| 1 | 0.00000 | -0.00000 |
| 2 | 0.00000 | -0.00000 |
| 3 | 0.00000 | -0.00000 |
| 4 | 0.00000 | -0.00000 |
| 5 | 0.00000 | -0.00000 |
| 6 | 0.00000 | -0.00000 |
| 7 | 0.00000 | -0.00000 |
| 8 | 0.00000 | -0.00000 |
| 9 | 0.00000 | -0.00000 |
| 10 | 0.00000 | -0.00000 |
| 11 | 0.00000 | -0.00000 |
| 12 | 0.00000 | -0.00000 |
| 13 | 0.00018 | 0.00036 |
| 14 | 0.00026 | 0.00115 |
| 15 | 0.00033 | 0.00214 |
| 16 | 0.00034 | 0.00247 |
| 17 | 0.00034 | 0.00281 |
| 18 | 0.00034 | 0.00315 |
| 19 | 0.00036 | 0.00351 |
| 20 | 0.00037 | 0.00389 |
| 21 | 0.00038 | 0.00427 |
| 22 | 0.00039 | 0.00466 |
| 23 | 0.00039 | 0.00544 |
| 24 | 0.00039 | 0.00583 |
| 25 | 0.00055 | 0.00748 |
| 26 | 0.00055 | 0.00969 |
| 27 | 0.00069 | 0.01038 |
| 28 | 0.00069 | 0.01107 |
| 29 | 0.00077 | 0.01185 |
| 30 | 0.00082 | 0.01593 |
| 31 | 0.00091 | 0.01684 |
| 32 | 0.00094 | 0.01778 |
| 33 | 0.00094 | 0.01967 |
| 34 | 0.00098 | 0.02064 |
| 35 | 0.00101 | 0.02670 |
| 36 | 0.00101 | 0.02771 |
| 37 | 0.00140 | 0.02911 |
| 38 | 0.00164 | 0.03075 |
| 39 | 0.00172 | 0.03247 |
| 40 | 0.00229 | 0.03476 |
| 41 | 0.00274 | 0.03750 |
| 42 | 0.00334 | 0.04084 |
| 43 | 0.00353 | 0.04436 |
| 44 | 0.00514 | 0.04950 |
| 45 | 0.00901 | 0.05851 |
| 46 | 0.01005 | 0.06857 |
| 47 | 0.02253 | 0.09110 |
| 48 | 0.06112 | 0.21334 |
| 49 | 0.25380 | 0.46714 |
fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")
ax.set_xlabel("effective alpha")
ax.set_ylabel("total impurity of leaves")
ax.set_title("Total Impurity vs effective alpha for training set")
plt.show()
Next, I am train a decision tree using the effective alphas. I will set these values of alpha and pass it to the ccp_alpha parameter of our DecisionTreeClassifier. By looping over the alphas array, I will find the accuracy on both Train and Test parts of our dataset.
clfs = []
for ccp_alpha in ccp_alphas:
clf = DecisionTreeClassifier(
random_state=1, ccp_alpha=ccp_alpha, class_weight={0: 0.15, 1: 0.85}
)
clf.fit(X_train, y_train)
clfs.append(clf)
print(
"Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
clfs[-1].tree_.node_count, ccp_alphas[-1]
)
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.25379571489480957
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]
node_counts = [clf.tree_.node_count for clf in clfs]
depth = [clf.tree_.max_depth for clf in clfs]
fig, ax = plt.subplots(2, 1, figsize=(10, 7))
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")
fig.tight_layout()
recall_train = []
for clf in clfs:
pred_train = clf.predict(X_train)
values_train = recall_score(y_train, pred_train)
recall_train.append(values_train)
recall_test = []
for clf in clfs:
pred_test = clf.predict(X_test)
values_test = recall_score(y_test, pred_test)
recall_test.append(values_test)
train_scores = [clf.score(X_train, y_train) for clf in clfs]
test_scores = [clf.score(X_test, y_test) for clf in clfs]
fig, ax = plt.subplots(figsize=(15, 5))
ax.set_xlabel("alpha")
ax.set_ylabel("Recall")
ax.set_title("Recall vs alpha for training and testing sets")
ax.plot(
ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post",
)
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")
ax.legend()
plt.show()
**Maximum value of Recall is at 0.06 alpha, but I will choose decision tree will only have a root node and I would lose the buisness rules, I am going to take 0.02 where both recall are same.
Creating model with 0.02 ccp_alpha
best_model2 = DecisionTreeClassifier(
ccp_alpha=0.002, class_weight={0: 0.15, 1: 0.85}, random_state=1
)
best_model2.fit(X_train, y_train)
DecisionTreeClassifier(ccp_alpha=0.002, class_weight={0: 0.15, 1: 0.85},
random_state=1)
confusion_matrix_sklearn(best_model2, X_train, y_train)
decision_tree_postpruned_perf_train = get_recall_score(best_model2, X_train, y_train)
print("Recall Score:", decision_tree_postpruned_perf_train)
Recall Score: 0.9667673716012085
confusion_matrix_sklearn(best_model2, X_test, y_test)
decision_tree_postpruned_perf_test = get_recall_score(best_model2, X_test, y_test)
print("Recall Score:", decision_tree_postpruned_perf_test)
Recall Score: 0.9060402684563759
plt.figure(figsize=(15, 10))
out = tree.plot_tree(
best_model2,
feature_names=feature_names,
filled=True,
fontsize=9,
node_ids=False,
class_names=None,
)
for o in out:
arrow = o.arrow_patch
if arrow is not None:
arrow.set_edgecolor("black")
arrow.set_linewidth(1)
plt.show()
# Text report showing the rules of a decision tree -
print(tree.export_text(best_model2, feature_names=feature_names, show_weights=True))
|--- Income <= 98.50 | |--- CCAvg <= 2.95 | | |--- weights: [374.10, 0.00] class: 0 | |--- CCAvg > 2.95 | | |--- CD_Account <= 0.50 | | | |--- CCAvg <= 3.95 | | | | |--- Income <= 81.50 | | | | | |--- weights: [7.35, 2.55] class: 0 | | | | |--- Income > 81.50 | | | | | |--- weights: [4.35, 9.35] class: 1 | | | |--- CCAvg > 3.95 | | | | |--- weights: [6.75, 0.00] class: 0 | | |--- CD_Account > 0.50 | | | |--- weights: [0.15, 6.80] class: 1 |--- Income > 98.50 | |--- Education <= 1.50 | | |--- Family <= 2.50 | | | |--- Income <= 100.00 | | | | |--- weights: [0.45, 1.70] class: 1 | | | |--- Income > 100.00 | | | | |--- weights: [67.20, 0.85] class: 0 | | |--- Family > 2.50 | | | |--- weights: [1.65, 45.90] class: 1 | |--- Education > 1.50 | | |--- Income <= 116.50 | | | |--- CCAvg <= 2.80 | | | | |--- Income <= 106.50 | | | | | |--- weights: [5.40, 0.00] class: 0 | | | | |--- Income > 106.50 | | | | | |--- weights: [6.45, 5.95] class: 0 | | | |--- CCAvg > 2.80 | | | | |--- weights: [1.50, 19.55] class: 1 | | |--- Income > 116.50 | | | |--- weights: [0.00, 188.70] class: 1
# importance of features in the tree building ( The importance of a feature is computed as the
# (normalized) total reduction of the 'criterion' brought by that feature. It is also known as the Gini importance )
print(
pd.DataFrame(
best_model2.feature_importances_, columns=["Imp"], index=X_train.columns
).sort_values(by="Imp", ascending=False)
)
Imp Income 0.62627 Family 0.14876 Education 0.13247 CCAvg 0.08068 CD_Account 0.01182 Age 0.00000 Mortgage 0.00000 Securities_Account 0.00000 Online 0.00000 CreditCard 0.00000
importances = best_model2.feature_importances_
indices = np.argsort(importances)
plt.figure(figsize=(12, 12))
plt.title("Feature Importances")
plt.barh(range(len(indices)), importances[indices], color="violet", align="center")
plt.yticks(range(len(indices)), [feature_names[i] for i in indices])
plt.xlabel("Relative Importance")
plt.show()
Observation:
I am getting a higher recall on test data between 0.002 to 0.005. Will choosed alpha as 0.002. The Recall on train and test indicate I have created a generalized model. with 96 % accuracy and reduced False negatives. Important features : Income, Graduate education, Family member 3 and 4, Ccavg, Advanced education, Age. This is the best model as false negative is only 6 on Testdata.
# training performance comparison
models_train_comp_df = pd.DataFrame(
[
decision_tree_perf_train,
decision_tree_tune_perf_train,
decision_tree_postpruned_perf_train,
],
columns=["Recall on training set"],
)
print("Training performance comparison Recall:")
models_train_comp_df
Training performance comparison Recall:
| Recall on training set | |
|---|---|
| 0 | 1.00000 |
| 1 | 0.97583 |
| 2 | 0.96677 |
# testing performance comparison
models_test_comp_df = pd.DataFrame(
[
decision_tree_perf_test,
decision_tree_tune_perf_test,
decision_tree_postpruned_perf_test,
],
columns=["Recall on testing set"],
)
print("Test performance comparison:")
models_test_comp_df
Test performance comparison:
| Recall on testing set | |
|---|---|
| 0 | 0.87919 |
| 1 | 0.95973 |
| 2 | 0.90604 |